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Abstract 

Motivated by online settings where users can provide explicit feedback about the 
relevance of products that are sequentially presented to them, we look at the rec¬ 
ommendation process as a problem of dynamically optimizing this relevance feed¬ 
back. Such an algorithm optimizes the fine tradeoff between presenting the prod¬ 
ucts that are most likely to be relevant, and learning the preferences of the user so 
that more relevant recommendations can be made in the future. 

We assume a standard predictive model inspired by collaborative filtering, in 
which a user is sampled from a distribution over a set of possible types. For 
every product category, each type has an associated relevance feedback that is as¬ 
sumed to be binary: the category is either relevant or irrelevant. Assuming that the 
user stays for each additional recommendation opportunity with probability /3 in¬ 
dependent of the past, the problem is to find a policy that maximizes the expected 
number of recommendations that are deemed relevant in a session. 

We analyze this problem and prove key structural properties of the optimal pol¬ 
icy. Based on these properties, we first present an algorithm that strikes a balance 
between recursion and dynamic programming to compute this policy. We further 
propose and analyze two heuristic policies: a ‘farsighted’ greedy policy that at¬ 
tains at least 1 — /3 factor of the optimal payoff, and a naive greedy policy that 
attains at least factor of the optimal payoff in the worst case. Extensive sim¬ 
ulations show that these heuristics are very close to optimal in practice. 


1 Introduction 

Predicting the preferences of users in order to present them with more relevant engagements is a 
fundamental component of any recommendation system ED ED. Over the years, a wide variety of 
approaches have been proposed for this problem (see |T| for a survey). These include content based 
approaches that rely on generating user and item profiles based on available data l22]|T8l . collab¬ 
orative filtering approaches Ea da that recommend items based on similarity measures between 
users and/or items, and a combination of both BJ®. In this paper, motivated by several settings 
of interest in which explicit feedback about the relevance of the recommendations can be received 
from the user on small timescales, we pursue a less studied approach (see ll28l ) of modeling the 
recommendation process as a sequential optimization problem. Below are a few examples of such 
settings. 

• Online retail: A user enters an online shopping portal to purchase an accessory, e.g. a 
watch. She is sequentially presented with various design choices and based on her feedback 
to these designs, the system adaptively presents recommendations that are more likely to 
be liked by her. 

• Online media-on-demand services: A user using an online music-on-demand service would 
like to find a new genre of music to listen to. Short sound-clips are played for her sequen- 


1 




tially, and based on the feedback that she provides for these clips, the recommendation 
system seeks to adaptively find genres that are better suited to her tastes. 

• Advertising in online video: As video ads are inherently more disruptive of a user’s at¬ 
tention, and thus potentially more valuable than sponsored search ads, there is a strong 
motivation for designing ad allocation mechanisms that take into account the relevance of 
these ads to the users. Services like YouTube and Hulu collect explicit feedback about the 
relevance of an ad after it is shown, and this feedback can be used to adaptively learn the 
preferences of the users and show more relevant ads. 

We consider a model that is derived from cluster models for collaborative filtering (see 0) in which 
the history of user behaviors is compressed into a predictive model, where users are classified into 
‘types’ that capture the preference profile of the user. A typical recommendation generation algo¬ 
rithm dynamically observes user behavior and uses maximum-likelihood estimates based on this 
predictive model to choose products that are more likely to be relevant. Our approach replaces 
this maximum-likelihood estimation with a sophisticated optimization problem, in which the two 
conflicting goals of presenting the most relevant products based on current predictions of user pref¬ 
erences, and learning the underlying type of the user so that more relevant engagements can be 
shown later, are concurrently optimized in a precise and systematic way. 

Our model assumes that a user that enters the system is sampled from a probability distribution 
over a set of types that is a priori known to the system designer. Each type is associated with 
a string of ‘relevance’ ratings for the different categories of products. We focus on the simplest 
case in which this relevance rating is binary, i.e. the user considers a category of products either 
relevant or irrelevant. We assume that the number of recommendation opportunities available in a 
session is random, modeled as a geometric random variable arising from the assumption that the 
user stays for each additional opportunity with a fixed probability /3 independent of the past. Under 
this setting we focus on the problem of adaptively maximizing the expected cumulative relevance of 
recommendations presented to the user during the session. 

Our main contribution in this paper is the analysis of this sequential relevance maximization prob¬ 
lem. At first glance, one can see that the optimal policy can be determined using a naive recursive 
algorithm. But as is typical of such algorithms, it is highly inefficient due to repetition of redun¬ 
dant work. The standard tool to solve such problems is dynamic programming, which turns these 
inefficient recursive algorithms into efficient iterative solutions. But unfortunately in our case, the 
state space for this program grows exponentially in the number of types and categories. Further, an 
efficient enumeration of these states is difficult. 

We first derive certain key properties of the structure of the optimal policy using probabilistic inter¬ 
change arguments. Using these properties, we provide an algorithm that strikes a balance between 
recursion and dynamic programming to solve for the optimal policy. Unfortunately, this algorithm 
still remains computationally prohibitive. Motivated by our structural results, we then propose and 
analyze two heuristic policies: a ‘farsighted’ greedy policy that is easier to compute and a naive 
greedy policy that is analogous to the maximum-likelihood prediction performed by typical recom¬ 
mendation systems. We then prove that these policies are approximately optimal, i.e. they achieve 
a constant factor of the optimal payoff. We finally perform extensive simulations on random prob¬ 
lem instances and we observe that these heuristic policies typically perform much better than that 
predicted by our worst case bounds. 


1.1 Related work 

The idea of posing the recommendation process as an optimization problem is not new. To the best 
of our knowledge, its earliest appearance in literature can be traced back to [5], which proposed a 
decision-theoretic modeling of the problem of generating recommendations (on a palm-top) for a 
user navigating through an airport. l28l proposed a framework for modeling the sequential opti¬ 
mization problem in online recommendation systems as a Markov Decision Process (MDP) |[23l . 
Their underlying formulation is quite general and their focus is on defining and establishing this 
paradigm. The model that we consider on the other hand is more structured and our focus is on the 
analysis of the resulting optimization problem. 
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The sequential relevance maximization problem is closely related to Bayesian multi-armed ban¬ 
dit problems. In a multi-armed bandit problem (MAB), first introduced by Thompson in l29l . a 
decision-maker faces a set of arms whose reward characteristics are uncertain and seeks to optimize 
the sequence in which they are pulled so as to maximize some long-run reward. In such problems 
one faces the tradeoff between exploration, i.e. learning the reward characteristics of the arms, and 
exploitation, i.e. accumulating rewards by choosing good arms based on current estimates. These 
problems have been commonly studied under two distinct settings: Bayesian and stochastic, with a 
different set of analytical approaches used in each. Our model falls in the Bayesian setting 02 ED, 
in which an initial prior distribution is assumed over the parameters of a probabilistic reward gen¬ 
erating model for each arm, and one performs Bayesian updates of these estimates as rewards are 
observed. One then solves the well-defined problem of maximizing either the long-term average or 
discounted cost. The standard solution tool in this case is dynamic programming. The stochastic 
setting 031 ED (also see □ for a recent survey) does not assume any prior distribution over the 
parameters and one instead tries to find policies that minimize the worst case rate at which losses 
relative to the expected reward of best arm (called ‘regret’) are accumulated 00. The focus is on 
characterizing this optimal rate. 

Most of the literature in these settings has focused on the case where the rewards of different arms 
are statistically independent. In the Bayesian case, a seminal result by Gittins 02 shows that the 
optimal policy dynamically computes an index for each arm independently of all other arms, and 
picks the arm with the highest index at each step. But in our case the relevance of different products 
are correlated through the hidden user type and hence it is a type of a Bayesian MAB problem 
with correlated or dependent arms. It is well known that the decomposition result of Gittins does 
not hold for this case. Over the years there has been sporadic progress in tackling this problem, 
with most papers focusing on specific models. m and m analyze two-armed bandit problems in 
which reward characteristics of two arms are known, but which arm corresponds to which reward 
distribution is not known, which leads to a natural dependence between the arms. In this case, 
the general case of more than two arms still remains open. ED studies another version of the 
problem in which the arms can be grouped into clusters of dependent arms, in which case the Gittins 
decomposition result can be partially extended. Recently, llT9ll considered a specific model of a MAB 
problem with dependent arms, where they analyzed the performance of a greedy policy and derived 
asymptotic optimality results. These type of problems have recently also gained attention in the 
stochastic setting Elfl0ll26l (also see M for the case of binary rewards), although the formulations 
and techniques in that setting are very different. The broad conclusion from this body of work is 
that the correlation between arms can be exploited to achieve better regret rates. 

Another important difference between MAB problems and our problem is that, since any product can 
be presented only once and since there is a finite number of products in any category, there is a bound 
on the number of times each ‘arm’ can be pulled. Thus one cannot ‘exploit’ an arm forever and is 
forced to experiment intermittently. In the special case when each category has a single product, our 
problem is also related to the active sequential hypothesis testing problem ||9| 1201. In this problem, 
one seeks to speedily learn a hidden random variable by adaptively choosing a sequence of correlates 
to observe, with a cost for each observation. This formulation would have been appropriate if our 
objective was to quickly learn the user type without any concern for the relevance feedback. But 
since our goal is to optimize the latter, a different approach is necessary. 


1.2 Structure of the paper 


The structure of the paper is as follows. In Section [2] we introduce our model and define the rele¬ 
vance optimization problem. Sectionals devoted to the analysis of this problem, in which we derive 
key structural properties of the optimal policy and finally present an algorithm to compute it. In Sec¬ 
tion]?] we propose two policies which are easier to compute and prove that they are approximately 
optimal. In Section [5] we extensively simulate our two approximately optimal policies on randomly 
generated problem instances and compare their performance to the optimal policy. Finally Section 
[^summarizes our work and discusses extensions to our model. The proofs of all our results can be 
found in the appendix. 
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Figure 1: A sample relevance matrix with 4 product categories (A,B,C,D) and 4 types (1,2,3,4) 


2 Model 

We consider the setting of a user who enters an online system and is sequentially presented with 
products from different categories, with the goal of maximizing the number of relevant products 
presented to him before he eventually leaves the system. Assume that there are L total products. 
The products are divided into categories, with each category representing a set of similar products. 
Let these categories be labeled as j <E {1, • • • , //} = [//]. Each category j has Lj products. A given 
user considers some set of categories to be relevant to him and this set is not known a priori. The 
system designer elicits explicit feedback about the relevance of a product after it is presented. This 
feedback is obtained as an answer to an explicit question, is assumed to be binary, and takes value 1 
(resp. 0) when the product is relevant (resp. irrelevant). We assume that this feedback is accurately 
provided by the user. A product cannot be presented more than once to the same user during the 
session. Hence the maximum number of products that can be shown is restricted to L. 

We capture the uncertainty in the preferences of the user by assuming that the user is one of N 
possible types and the actual type of the user is a latent random variable that is not observed at the 
beginning of the session. Let X £ [iV] denote this random variable. Let px be the corresponding 
probability distribution. We assume that the system designer only knows this distribution px■ For 
each user type i and for each product category j, let q'j £ { 0 , 1 } denote the fixed binary relevance 
feedback of the user of type i to that category. The type of the user is not known, and so for each 
category j, we introduce a random variable Y :j £ {0, 1 } which represents the binary feedback of a 
user for any product in that category, px induces a probability distribution on Yj: 

N 

P(Y j = l) = Y l 9}Px(i). 

i—1 

It is convenient to associate each user type i £ [A - ] with an //-length binary vector of the {</]}, 
j £ [H] values for different categories. Hence we can define a N x H relevance matrix Q = {(/*■}, 
whose rows represent user types, and columns represent product categories^ Figure[l]is an example 
of a relevance matrix with four types of users labeled 1 to 4 and four product categories labeled A 
to D. Each category has some specified number of products. For instance, type 1 finds category A 
and C relevant and finds B and D irrelevant. 

The number of display opportunities that are available before the user leaves the system is modeled 
as a random variable C £ { 1, 2, • • • } with a geometric probability distribution pc where pc(m) = 

'Notice that type space is quite general. If for a user, there is a joint distribution over finding different 
categories relevant then we can think of a user as being a convex combination of the types corresponding to the 
realizations of binary relevance vectors with the associated probabilities. Also, if some variation is observed in 
the feedback received for products that belong to the same category, then each of the products can be declared 
as individual categories. Although this increases the problem size, our analysis remains applicable. 
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/3 m_1 (1— (3) for m > 1. In other words, the user dynamics in the system is modeled as a memoryless 
random process, in which a user stays for each additional opportunity with probability j3 or exits with 
probability 1 — /3, independently of the past. This assumes that at least one opportunity is always 
available. Finally, the random variable C is independent of the user type X. The feedback for a 
product can be obtained after every display opportunity, but since the feedback for a product is the 
same for every other product in its category, one can assume that the feedback is requested and 
obtained only when the product presented belongs to a category that has not been shown before. 


2.1 Relevance maximization 


The primary objective of the system designer is to maximize the expected number of relevant prod¬ 
ucts presented to a user in the session. Once a user enters the website, at each display opportunity, 
the system designer adaptively decides which product should be shown to the user, while taking all 
the user feedback obtained in the past into consideration. We define the objective formally. A policy 
ip for the designer is the sequence of maps ip = {ipi, • • • ,iPl} where each map ip t : H t —> A t is a 
mapping from the set of possible observations of user feedback until time t, denoted by H t , to the 
set of possible actions A t , which is the set of choices of products. Let 'f be the set of all feasible 
policies. The objective of the designer is to find a policy which maximizes the expected number of 
relevant ads shown in a session under the constraint that no product is shown more than once. Let 
l t denote the product chosen at time t. Once a policy ip £ T is chosen, l t is a well defined random 
variable. With some abuse of notation, let j{l t ) be its category. Then the objective of the publisher 
is the following. 

c 

maxE’f., 


c 

subject to < 1 for each product l £ [L\. 

t —l 


Assuming memoryless user dynamics, the optimization problem (|T]» takes the following form 

OO 


max 




E 




t= 1 


KM- 


(i) 


( 2 ) 


As mentioned earlier, this problem is a type of a Bayesian multi-armed bandit problem with cor¬ 
related rewards (see OH HD) with an additional constraint on the number of times each arm may 
be pulled. At a first look, one can solve this problem using the following recursive program in Al¬ 
gorithm 1. But it is well known that such recursive algorithms can be very inefficient. The usual 
problem is when recursion leads to repeating work. This happens when you have overlapping sub¬ 
problems, which is unfortunately the case here. Turning these inefficient recursive algorithms into 
efficient iterative algorithms is the role of dynamic programming. This requires us to define a state 
space of possible ‘information states’ for each opportunity t, which encapsulate all the information 
that has been gained till time t. In our case, the information state corresponds to a smaller relevance 
matrix obtained after computing the posterior distribution on the types, by eliminating all the rows 
corresponding to user types that have conditional probability 0 and all the columns corresponding to 
categories that have been exhausted. The state space thus grows prohibitively large with time and its 
enumeration is cumbersome. In the next section, we prove some structural properties of the optimal 
policy and based on these we provide an efficient algorithm that strikes a balance between recursion 
and iteration in order to compute this policy. These structural results are motivated by the following 
examples. 

Example: A triangular relevance matrix : Consider the relevance matrix shown in Figure [2] A 
quick circumspection convinces us that the optimal policy is one which shows the categories in the 
order A, B, C and then D. If a positive feedback is obtained for a category then all the advertisers in 
that category are exhausted. To see this, observe that this policy attains the optimal payoff obtained 
in the case that the type of the user is known at arrival. Structurally, there is a partial order relation 
on the categories where one category ‘dominates’ the other if the set of types which find it relevant 
is a strict subset of the set of types which finds the other relevant. This example shows that if this 
partial order relation leads to a complete ordering of the categories then the optimal policy simply 
presents the categories according to this order. But what if that is not the case? In lemma 2, we prove 
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Algorithm 1 (Optimal) Function [V(Q, p, (3), A(Q, p, (3)\ where Q is a relevance matrix and p is a 
probability distribution over user types. 

• If Q is empty, return V (Q,p, /3) = 0. 

• For a category j, let Mj denote the set of user types which find j relevant and let P(Mj) = 
P(X £ Mj). Also let Q 3 be the matrix obtained after removing the column corresponding 
to category j and the rows corresponding to all the user types in Mj and let Q 3 res be 
the matrix obtained after removing the column corresponding to category j and the rows 
corresponding to all the user types in Mj. Finally, let p 1 denote the distribution on the user 
types conditional on the event {A £ Mj} and p J res be the distribution on the user types 
conditional on {X £ Mj}. 

• Then define 

Vj = P(Mj)(^^+^V(Q 3 y,^ 

+(1 -P(Mj))pV(Qi es ,pi es ,p) 

• Return 

V{Q,p,fi) = max V j 
j 

A(Q,p,f3) £ arg max V 3 
j 
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Figure 2: A triangular relevance matrix. The optimal policy is to present categories in the order A, 
B, C and then D. 


an appropriate generalization of this property for arbitrary relevance matrices using a probabilistic 
interchange argument. We show that if a category dominates some other then in the optimal policy 
it is presented before the other. 

Example: A permutation relevance matrix : Consider the relevance matrix shown in Figure [3] 
One can argue that in this case the optimal policy is greedy: choose the category with the maximum 
expected number of relevant ads. In fact, if the relevance matrix is a permutation of smaller block 
matrices, with multiple categories in each, we can consider the relevance optimization problem for 
each of the smaller blocks separately and greedily choose the order in which these blocks are chosen. 


3 Characteristics of the optimal allocation policy 

In this section we present some structural properties of the optimal allocation policy. 
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Figure 3: A diagonal relevance matrix. The greedy policy is optimal. 


3.1 Property 1: If category A is relevant, show it 

We first present the following intuitive property. 

Lemma 3.1. In the optimal allocation policy, at any opportunity, conditional on the past observa¬ 
tions, if there exists a product category j that will generate a positive feedback with probability 1, 
i.e. P{Yj = 1 | H t ) = 1, then any product in j that has not been shown is allotted immediately. If 
there are multiple such products then they can be allotted in any order. 

This property implies that if a positive feedback is received for a product belonging to a particular 
category j, then all Lj products of that category are scheduled to be presented in the immediately 
following opportunities^] The proof uses a simple probabilistic interchange argument. 

3.2 Property 2: If ‘likes A’ implies ‘likes B\ then show B before showing A 

To describe this next property, we first formally define a few ideas. In the dynamic allocation 
of products to the opportunities, we call an opportunity t to be an experimentation opportunity if 
conditional on information obtained until time t — 1 , there is not a single category j such that 
Yj = 1 with probability 1. If there existed such a category, the previous lemma tells us to exhaust all 
the advertisers in that category. But since there is no such category, an experimentation opportunity 
brings to us the non-trivial problem of deciding which category to present to the user next. Thus all 
the non-trivial decisions in the optimal dynamic allocation policy are taken at the experimentation 
opportunities. Let S(t) = {i £ [iV] : P(X = i \ H t ) > 0} be the set of user types that have a 
non-zero probability conditional on the history. Then note that after observing the feedback from 
the allocation made at an experimentation opportunity S(t — 1) — S(t) > 1. Let E{t) be the set of 
categories available i.e. which have not been presented till opportunity t. Let Q(t) be the relevance 
matrix with rows corresponding to the types in S(t) and the columns corresponding to the categories 
in E(t). Finally, for each category j in E(t), let Mj(t) = {i € S(t) : (f = 1}, which is the set of 
user types in S(t) which find category j relevant. 

Definition 3.1. We say that category j dominates category j' at opportunity t if Mjft) C Mj(t). 
The categories that are not dominated by any other category are called non-dominated categories. 

For instance in Figure [T] A, C and D are the only non-dominated categories since A dominates B. 
Then we show the following. 

Lemma 3.2. In the optimal allocation policy, at any experimentation opportunity, the product pre¬ 
sented must be of a non-dominated category. 


2 In order to not bore the user, we can introduce a bound on the number of products of the same category 
that can be successively shown to the user. 
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In other words, this lemma says that if the set of user types which find category A relevant is 
contained in the set of user types which find category B relevant, then in the optimal policy, category 
B is presented before category A. The proof of this lemma also uses a probabilistic interchange 
argument. Observe that the claim in the lemma is not an intuitively obvious fact. One may argue 
that in some cases, presenting a category that is dominated may help us learn the true user type faster 
and thus perform a better allocation in the future opportunities. Indeed if the goal is to minimize the 
expected number of opportunities taken to learn the user type exactly, then this property clearly does 
not hold (e.g. presenting a category that every user type finds relevant gives no information about 
the true type). 

Now let U(t) be a generic class of non-dominated categories that satisfy the condition that Mj(t) = 
Mji (t) for all j, j' £ U(t). This means that U(t) is a class of categories found relevant by ex¬ 
actly same set of types. U(t) will be called a non-dominated equivalence class of categories and 
denotes the set of types which find the class U(t) relevant. We allow for a class to be 
singleton in the definition and so suppose there are K(t) such non-dominated equivalence classes 
{U\, ■ ■ ■ , UKit) } that partition the set of non-dominated categories in the relevance matrix. Let this 
set of non-dominated equivalence classes of categories be denoted by 11(f). If furthermore the sets of 
types {Mjj^ , ■ • • , Mu kw } are mutually disjoint, then we say that the set of non-dominated equiva¬ 
lence classes partition the type space. In this case, the relevance matrix can be represented as a block 
diagonal matrix composed of K(t ) smaller block matrices (up to permutation of the K (f) blocks), 
with each block matrix corresponding to an equivalent non-dominated class. Such a small block is 
composed of columns of all Is, one for each category in the class, and columns corresponding to the 
categories that the class dominates. 

As products are presented and we recompute the relevance matrix after each feedback, we may 
lose non-dominated categories or new categories may become non-dominated. Thus the set of non- 
dominated equivalence classes will change. But in the case where new categories are added to a 
class of non-dominated categories, we want to be able to identify the new class with the old class. 
This can be done since the categories in an equivalence class in the relevance matrix at the first 
display opportunity will continue to remain in the same class as long as they are non-dominated 
and they have not been presented. Thus a class U in subsequent display opportunities is identified 
by equivalence to the set of categories in U at the first display opportunity. For example, in the 
relevance matrix in Figure [I] as mentioned before U\ = {A}, U 2 = {C } and U 3 = { I )} are the 
non-dominated categories at the first opportunity. Suppose C is presented and a negative feedback 
is received. Then in the new relevance matrix obtained after deleting rows corresponding to type 
1 and type 4, and column corresponding to category C, the only remaining non-dominated class is 
{A, B}. In this case we identify {A, B} with U\, which was the class that contained A in the first 
opportunity. Similarly if you present A initially and get a negative feedback, then {C,D} is left as 
the only non-dominated equivalence class, which is a result of merging classes { C } and { D }. In 
this case the new class is identified with any of the original classes U 2 or U 3 . This brings us to the 
following property of any relevance matrix that can be easily verified. 


Lemma 3.3. Consider a relevance matrix with an initial set of non-dominated classes of categories 
U. Suppose that a category from a class U £ 11 is presented. Suppose that a negative feedback is 
received for this category, and consider the new relevance matrix obtained after deleting the rows 
corresponding to user types that find the presented category not relevant and the column correspond¬ 
ing to the presented category. Then the new set of non-dominated equivalence classes of categories 
U' satisfies U'CIL 


Intuitively this is because, when a negative feedback is obtained at some opportunity t, the rows 
corresponding to the user types that provide positive feedback to the shown category get deleted and 
thus it cannot happen that a category that was dominated at opportunity t becomes non-dominated 
at t + 1. On the other hand, after a positive feedback, completely new non-dominated equivalence 
classes can appear in the new relevance matrix computed after the posterior update. For example if 
A is presented and a positive feedback is received, the new relevance matrix has positive probability 
only on types 1, 2 and 3. In that case, D is dominated by B and hence ({B},{C}) is the new set 
of non-dominated equivalence classes (they are not equivalent), where notice that {B} appears as a 
new (singleton) class. 
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3.3 Structure of the optimal policy 


The lemmas [XT] |3.2| and |3.3| reveal the following structure of the optimal policy. Beginning from 
a set of non-dominated equivalence classes of categories, these classes are presented in a certain 
order as long as we keep getting a negative feedback. If any class obtains a positive feedback in the 
process, then we present all the products in that class, ‘zoom in’ to the next level (eliminating all 
the other types from the relevance matrix) and restart with a new set of non-dominated equivalence 
classes. Utilizing this structure, the following Algorithm 2 computes the optimal payoff. 


Algorithm 2 (Optimal) Function V(Q,p,(3 ) where Q is a relevance matrix and p is a probability 
distribution over user types. 

• If Q is empty, return V ( Q,p , 0) = 0. 

• If Q is non-empty, enumerate the non-dominated equivalence classes of Q. Let them 
be (L/i, • • • , Uk). Calculate the number of products in each class, denoted by L k = 

L :i- 

• For each k = 1, • • • , A', and each n C {1, • • • , AT} such that k ^ n, let k) be the 
event {X £ S(n, fc)} where 

S( 7T, k) 

= {i £ N : = OVj £ U s , s £ 7randg] = IVj £ U k }- 

S(TT,k) is thus the set of user types that find all the classes with labels in ir irrelevant, 
but find the class U k relevant. Let Ql be the relevance matrix obtained from deleting all 
the rows corresponding to user types in S(ir,k) c and all columns corresponding to the 
categories in the classes in tt and the categories in k. Finally, let p\ be the probability 
distribution on the user types conditional on the event u>(n, k). Then define 

v: = P( W ( 7T, k)) + P Lk v(.QliPh £)) • 


• Return 

V{QtPtP) = OPT = 


max V kl 

fci,- t(i,- ,k) 




. p K -i Vl k 


kx 


(3) 


Here <r( 1, • • • , AT) is the set of permutations of the K non-dominated equivalence classes. 


The optimization problem ([3]), is defined on the domain of all the possible orderings of the non- 
dominated equivalence classes of categories in This problem can be solved more efficiently 
using dynamic programming as opposed to comparing all the possible AT! orderings. One can define 
the state of the program at step r as the set of classes {k\, k^, • ■ • , k r } that have been presented till 
step r. A substantial reduction in the state space comes from the fact that Q for any k does not 
depend on the order in which the classes in 7i> were presented, and hence the state of the program at 
any step needs to only remember this set. Thus the size of the state space is (^) + (^) + ■ ■ • + (^) = 
2 K . 

At each state at step r, in the worst case, AT — r number of sub-programs need to be called to evaluate 
the set of payoff-to-go corresponding to the classes that have not been presented, i.e. { V(Q0 ) '■ 
k £ ir r } Thus at each level in the recursive program, the number of sub-programs that are called 
is exponential in the number of non-dominated equivalence classes at that level in the worst case. 
Considering this, we turn to find good heuristic policies that are easier to compute. 


’Note that if for some order, conditional on a sequence of classes getting negative feedback, if some class 
that is next in the order is dominated, then that order will simply not be chosen as the optimal order. 
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4 Approximately optimal policies 

4.1 Policy 1: Farsighted Greedy 

Consider the optimization problem (Bb assuming that we have been given V(Q %) for each k = 
1, • • • , I\ , and each tt C {1, • • • , KJ such that k n. Now suppose that instead of optimally 
solving ([3]), we adopt the following ‘greedy’ policy. We iteratively define 

K = argmax{^ fcl ’ ,ka ~ 1 : i e {1, • • • , K} \ {fe*, - - • 

for s = 1, • • ■ . K. This policy assumes that the payoff-to-go from the ‘next level’ onwards is given. 
But since it is not, we recursively compute an approximation to this payoff-to-go by assuming that 
we will follow the same greedy strategy in all the subsequent levels of the optimization problem. 
Algorithm 3 computes the proposed policy and its payoff. 


Algorithm 3 (Farsighted greedy): Function W(Q, p, 13) where Q is a relevance matrix and p is a 
probability distribution over user types. 

• If Q is empty, return W ( Q , p , /?) = 0. 

• If Q is non-empty, enumerate the non-dominated equivalence classes of Q. Let them 
be (t/i, • • • ,Uk )• Calculate the number of products in each class, denoted by L k = 

S jeu k L r 

• Let the event w(7r, k) be as defined in Algorithm 2. Similarly define Q £ and p £. 

• Iteratively compute 

K = argmax{wf*’ : * e {1, • • • , K} \ {*$, • • • , k*^}} (4) 

where 

W£ = P(w( TT, k)) + P Lk W{Ql,pZ, ^ . 

• Return 

W(Q,p, 13) = W fc . + t3w£l + • • • + i3 K - x W^" Xk -\ 


Note the computational savings as compared to the algorithm for computing the optimal policy. The 
comparison in equation (|4ji when s classes have been presented already is over K — s possibilities in 
the worst case. Thus the number of times a sub-program is called is K+(K-l) + (K—2) + - ■ - + 1 = 
. Thus at each level in this recursive program, the number of sub-programs that are called 
is quadratic in the number of equivalence classes at that level. We can then prove the following 
performance guarantee for this policy. 


Theorem 4.1. Let L m i n = min^gi... n Lj be the minimum number of products in any category and 
let H be the total number of categories. The farsighted greedy algorithm achieves 
factor of the optimal payoff. 


Note that the worst case is when H is large and L m i n = 1, in which case the adaptive greedy policy 
achieves a 1 — (3 factor of the optimal payoff. The key idea of the proof is as follows. The departure 
from optimality at any level has two sources: the fact that the payoff-to-go from the next level 
onwards is an approximation to the optimal payoff-to-go, and the order in which the non-dominated 
classes are presented in the current level is chosen greedily. If one assumes that the ratio of the 
approximation to the optimal payoff-to-go and the optimal payoff-to-go at the next level is some 7 , 
and if one can quantify the departure from optimality of the greedy policy at the current level, one 
can compute a bound for the worst case ratio of the current payoff-to-go and the optimal current 
payoff-to-go as some 7 ' = /( 7 ). One can show that this operator is a contraction. Thus one can 
recursively find a sequence of lower bounds that are uniformly bounded below by the fixed point of 
this sequence, which is the quantity in the theorem. 

Note that the description of Algorithm 3 can be simplified by fully exploiting its recursive structure; 
we presented it in the current form to show the correspondence to Algorithm 2 and also to facilitate 
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the argument in the proof of Theorem 4.1 
appendix. 


The equivalent implementation can be found in the 


4.2 Policy 2: Naive Greedy 

Another simple heuristic that we can use is the following greedy policy. 


Policy (Naive Greedy) Let the set of non-dominated equivalence classes at an experimentation 
opportunity t be , Unit)), and (L 1 (f), • • • , L K (t)) be the number of products in each of 

these classes. Then choose a product from a class k* where 

1 - 0 Lk M 

k* £ arg max —-—P( X G M Uk(t) \ H t ). 

fce{i, ,k\ i — p 


We then have the following theorem. 

Theorem 4.2. The Naive Greedy policy achieves factor of the optimal payoff. 


Note that in the worst case, when L mm = 1 and H is large, the greedy algorithm achie ves a t least 
factor of the optimal payoff. The proof of this theorem is similar to that of Theorem 


4.1 


4.3 A lower bound for 0 close to 1 

We can also obtain a lower bound on the ratio of payoffs under either of the heuristic algorithms and 
the optimal algorithm for values of 0 close to 1. Intuitively, this follows from the observation that 
if user stays for long enough so that the number of ad opportunities available is greater than L, then 
any policy obtains all the positive feedback that one can possibly obtain. 

Theorem 4.3. Any feasible policy attains 0 L ~ l factor of the optimal payoff. 


5 Simulations 

In this section we compare the performance of the the greedy with foresight policy and the naive 
greedy policy with the optimal policy. We generate 50 random samples each of 5 x 5 and 7x7 
relevance matrices with associated randomly chosen priors. We compute the payoff under all the 
three policies, for 0 ranging from 0 to 1. For each 0, we then plot the average and the minimum 
across the 50 samples of the ratio of the payoff under a non-optimal policy and the optimal policy. 
Our results are shown in Figure [4] 

Note that both the policies perform very close to optimal even in the worst case across the samples. 
Also, observe that for 0 close to 0 and for 0 close to 1, the payoff under both the policies approach 
the optimal payoff, which corroborates our bounds in theorems |4.1[ |4.2| and |4.3| The curve corre¬ 
sponding to the naive greedy policy is smooth because the policy does not depend on 0 and hence 
the resulting payoff is continuous in 0 (and so is the optimal payoff). 

6 Discussion and Conclusions 

Our main contribution in this paper is the introduction and analysis of the sequential relevance max¬ 
imization problem with binary feedback. This problem naturally arises in several settings where a 
designer needs to adaptively make a sequence of suggestions to a user while learning his prefer¬ 
ences from his feedback. This basic framework is amenable to extensions that adapt our approach 
to a more practical setting where some of our assumptions may not hold. For example, we assume 
that the number of display opportunities in a session is independent of both the type of the user and 
the relevance feedback, which may not hold in practice. For example, a user may be more likely to 
leave sooner if he is consecutively shown irrelevant products. Also, one of our central assumptions 
is that the user feedback is binary, but in practice one may benefit from a more fine-grained feedback 
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Figure 4: Average and worst case ratios of payoffs under either of the heuristic policies and the 
optimal policy across 50 samples. 


from the user. For example, the user may convey a rating for the product which may be a number 
from 0 to 5. In this case, one would want to maximize the sum of ratings obtained for the products 
shown in a session. Another interesting extension is to incorporate the values of the products so 
that one maximizes the total value of relevant products shown to a user in a session. We leave these 
extensions for future work. 

User type and personalization: In the current era of personalization of web services, it is impor¬ 
tant that a recommendation system be sensitive to the transience in the preferences of the users. For 
example, a user’s preference for music can change every day, depending on her mood, company 
etc. A sequential optimization approach to generating recommendations can proactively learn these 
changes in user preferences by freshly eliciting relevance feedback for carefully chosen products, 
each time the user enters the system. 

In the model that we have considered in this paper, the type of a user captures her preferences for 
the session under consideration and the prior distribution over these types is assumed to be known 
to the system designer. One interpretation of this distribution is that it captures the preferences of a 
‘typical’ user in the population, and it is estimated from the observed behavior of all the past users. 
In another interpretation aligned with the notion of personalization, one can think of this distribution 
as capturing the variation in the preferences of the same user over multiple sessions. For example, 
in a naive interpretation, one can imagine that the type captures the ‘mood’ of a person, which is 
sampled independently everyday, and her preferences for music on a particular day depends on her 
mood on that day. Even more generally, there could be cross-temporal dependencies in these types. 
If one desires to optimize the performance of the recommendation process over multiple sessions, 
one needs to also estimate this type evolution process. We leave these considerations for future 
work. 
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7 Appendix 

7.1 Proof of lemma I3TI1 

Proof. The result follows from a simple interchange argument. Suppose at time t, the posterior 
distribution over the set of possible types is {P(X = i \ H t )} = (p\, - ■ ■ ,pf) and the set of 
remaining products is Apt). Consider the event W with a fixed realization of the user type X = i 
and a fixed realization of the random variable C = c, which is the time at which the user leaves. 
Thus on this event, the string of binary feedback for the different categories is {</* : j £ [7/]}. Then 
for any fixed policy if. the sequence of allocations of the products from time t onwards till time c 
is dictated by the policy and is determinate. Let this sequence of allocations be {l t ,lt+ i, ■ ■ ■ ,l c } 
and the corresponding sequence of feedback be {yj(i t ),yj(i t+1 ), ■ ■ ■ , y : j(i c )}- Suppose there exists 
an advertiser l* £ A(t ) such that, conditional on observations till time t — 1, Yj(p) = 1 w.p. 1. We 
now consider 2 cases: 

Case 1 : assume that on event W, for policy if , l* £ {l t ,l t + i, - ■ • ,l c }- Say l t > = l* for t! £ 
{t, ■ ■ ■ , c}. Then if /,' f t , we will construct a policy which generates the sequence of allocations 
{l* ft, ■ ■ ■ , It'- 1 ) h'+1 ) ‘ ‘ ‘ fc] and thus give the same payoff on event W. This policy ip is the 
following: 


1. Allot advertiser l* at time t. 

2. from time t + 1 onwards follow policy ip' assuming // t+1 = If 

3. When ip' prescribes allotting l* at time t' + 1, use the information yi- = 1 to update the 
history to H t : + i and remove l* from the set of available advertisers. Follow the prescription 
of ip' for allotting advertisers from t' + 1 and onwards. 


Clearly, this policy gives the same payoff on event W since only the time at which l* is allotted has 
been interchanged. 

Case 2 : assume that on event W, for policy ip', l* (j {l t , f+i, • • • ,l c }- Then observe that the policy 
ip described above generates sequence of allocations {/*, l t , l t + 1 , • • • , l c - 1 }- Thus the difference in 
payoff is given by 

Y Hl m ) - Y o{lc) = 1 ~ Y i{lo) ^ 0 

Thus the number of relevant ads shown under policy ip is at least as high as that under policy ip for 
every such event W from a set of disjoint events whose union is the entire probability space. Thus 
the expected number of relevant ads shown is also at least as high. □ 
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7.2 Proof of lemma 13.21 


Proof. The proof uses a probabilistic interchange argument. Suppose that at opportunity t, the 
optimal policy ip' allots a product l' of category j' while there exists a category j such that 
Mj> (t) C Let l be any generic product of category j. Now consider a policy ip which allots 

category j before category j' by allotting product l at opportunity t. To further describe this policy, 
we consider two cases: 


(A) If the user finds j relevant, it exhausts all the products in that category and then moves on 
to allotting j'. After it allots j' it behaves as if j was never allotted until if prescribes 
allotting j, upon which the designer updates the information that j is relevant and moves 
on to allot the next category prescribed by ip' and so on. 

(B) If the user finds j not relevant, then from time t +1 onwards it acts as if it allotted j' at time 
t and found that f is not relevant. Then when 'if prescribes allotting j, the designer updates 
the information that j is not relevant and moves on to allot the next category prescribed by 
ip' and so on. 

We will show that on every disjoint event of the underlying probability space, the system designer 
shows at least as many relevant products by following policy ip instead of ip'. Consider an event 
W on which the user leaves after opportunity c > t and on which the realization of the user 
type is X = i. Then for the policy ip', the sequence of allocations of the products from time t 
onwards until time c is dictated by the policy and is determinate. Let this sequence of allocations 
be {l t , l t+ 1 , • • • ,l c } and the corresponding sequence of feedback be {yj(i t ),Vj(i t+1 ), 

These allocations and feedback depend on the type i that was realized on W. For this, we consider 
3 mutually exclusive and exhaustive cases: 

Case 1: We first consider the case where on the event W, X = i £ My (t). Then observe 
that yj(i t ) = Dj(i') = 1 and immediately the designer deduces that Yj = Yj> = 1. Thus since 
ip is optimal, by the previous lemma w.l.o.g. the first Ly + Lj allocations in the sequence 
{It, lt+ 1 , • • • , 4 } can be assumed to be all the advertisers belonging to categories j' and j and thus 
the feedback is a sequence of Lj* + Lj Is. Note that policy ip will be operating under case (A) and 
thus it will also generate a sequence of allocations in which the first Lj> + Lj allocations in the 
sequence {It, lt+i, ■ • • , 4 } will be all the products belonging to categories j' and j (in a different 
order) and after that the rest of the sequence of allocations is identical to that under ip'. Thus on 
such an event W both the policies ip 1 and ip generate the same sequence of relevance feedback. 

Case 2: Let us now consider the case where on the event W, X = i £ Mj(t) — Mj>(t). 
In this case Dj{i t ) = Uj{V) = 0 but Yj = 1. In this case the policy ip’ generates the sequence of 
allocations {V, lt+i, ■ ■ ■ ,4} and gets the feedback {0, J/j(; t+1 ), • • • , Uj(i c )}- Where as observe that 
the policy ip operates under case (1) and the designer discovers that Yj = 1 by allotting l and 
then continues to exhaust all the products in j before switching to the prescriptions of ip'. Thus 
ip generates a sequence of allocations in which all the products in j are allotted first and then the 
prescription of ip’ is followed as described in case (A). In the case where ip' prescribed allotting 
j at some opportunity and was able to allot all the products in j until the final opportunity c, this 
leads to a sequence of allocations which is just a different ordering of the elements of the sequence 
{V, It- |-i, • • ■ ,4} and thus generates the same number of relevant products shown until time c. In 
the case where ip' allotted 0 < r < Lj products of category j up until the final opportunity c, then 
under policy ip, the last Lj — r products in the sequence { 0 , Uj(i t+1 ), ■ ■ ■ , UjU ,.)} are dropped out in 
lieu of the same number of products in category j in the beginning. But since all products in j are 
relevant, this number of relevant ads under policy ip is still at least as high as that under ip'. 

Case 3: Now consider the case where on the event W, X = i £ S(t) — Mj(t). In this 
case yj(i t ) = Uj(i') = 0 and also Yj = 0. Thus the policy ip' generates the sequence of allocations 
{l', It- |-i, • • ■ ,l c } and gets the feedback {0, Uj(i t+1 ), • ■ ■ , In this case ip operates under case 

( B). Now in the case where l (jL {V , lt+\, ■ ■ ■ ,4} for any product l in category j, ip generates the 
same sequence of feedback {0, Uj(i t+1 ), • • • , yj(i c )}- In the case where l £ {l', lt+i, ■ ■ • ,l c } f° r 
some product l in category j, then under ip', since j has already been tested in the beginning, the 
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negative feedback of category j is not repeated by re-allotting it. In lieu of that the policy moves on 
and a new feedback is obtained at the end which lay be 1 or 0. Thus ip generates at least as many 
relevant recommendations as ip'. 

Thus the number of relevant products shown under policy ip is at least as high as that under policy 
ip' for every such event W from a set of disjoint events whose union is the entire probability space. 
Thus the expected number of relevant products shown is also at least as high. 

□ 


7.3 Proof of Theorem 14. II 


Proof. First note that if the relevance matrix Q is such that all the categories form a single non- 
dominated equivalence class, then the farsighted greedy policy is the same as the optimal policy 
and so W{Q,p, 0) = V(Q,p, f3). Now consider an experimentation opportunity with an associated 
relevance matrix Q that has K non-dominated equivalence classes (U\, ■ ■ ■ , Uk)- Further assume 
that there is some factor 0 < 7 < 1 such that 


V(QZ,pZ,P) ~ 

for each k = 1, ■ • • . K, and each 7 r C {1, • • • , K} such that k p 7 r. Now we have 

V? = P(u(n,k))(^^- +ft Lk V{Ql,pl,p)) 


> P(u(n,k))(^^+'y0 Lh V(QZ,pZ,P)). (5) 


We thus have 

wz > 1 -^r + iP Lk v(.QhPliP) 

v * ~ 1 -^ + P Lk v(.Qi,Pi,p) 

Now it can be easily verified that for a positive constant c, the function /(it) = 
decreasing in it. Thus, since V(Q k , p k , f3) < ( , we have that 


wi 

viz 


1 -p Lk , iP Lk 
+ 1-/3 

1-/3 + 1-/3 


1 - (1 - 7)/^ > 1 - (1 - l)P Lm ' 


( 6 ) 

' s str i ct ly 


(7) 


where l/ nln = mill/. L k . This bound holds uniformly for each k = 1 ,■■■ ,K, and each 7 r C 
{!,••• , K} such that k p tt. Now if we define 


OPT' = 


max 

fci,- ,k K £a(l,--- ,K) 


w kl + mt + • • • + p 


K -1 


Wi 


• ,kK -1 


kK 


( 8 ) 


OPT' 


then from (7), one can easily show that 
farsighted greedy algorithm attains 1+f3 _^K 


V(Q,p,P) 

1 factor of OPT'. 


> 1 — (1 — 7)/3 l . Now we will show that the 

To show this we will use induction 
in the dynamic programming problem that solves (|8j). Let a, be the lower bound on the ratio of 
the payoff to go under the greedy policy and that under the optimal policy when the number of 
classes left is i where i varies from 1 to K in the problem ([ 8 |). We are interested in proving that 
°-k > Now if ki, ■ ■ ■ , kx-i is decided then there is only one option left for kx and 


hence the greedy policy gives the same payoff as the optimal payoff to go. Thus a± = 1. Now fix 
an i > 2 and consider the payoff to go under the optimal policy when K — i classes in the order 
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have been selected. Let the set of these classes already selected be denoted by labels in n and denote 
this optimal payoff to go by Gq PT ,. Denote the payoff to go under the greedy policy by Gg. Let 
the class selected by the greedy policy next be Uk for k £ { 1 . ■ ■ ■ ,K}\n. Then we have by the 
definition of 

Gg = W£ + /3Gg Ufc > W% + pa^G^, (9) 

Now first we have 

( 10 ) 


Gl PT ,= max WJ+pG^ T ,. 
j£ ,K\ 7 T 


Suppose now that a genie reveals the feedback for a class Uk for free at this point. Then the optimal 
payoff under this new information is higher than the optimal payoff if this information is not avail¬ 
able, i.e. Gq PT , (because one can always choose to ignore the genie). Denote this optimal payoff 
under the new information structure as Gq PT ,. Then under this new information, clearly if it is 
revealed that the feedback for (4 is positive then one exhausts all the advertisers in [ 4 , where as if 
the feedback is negative then one removes Uk from the set of classes and moves on without wasting 
any opportunity on testing Uk- Thus 


G, 


OPT 


•' — + Gop^ T , > Gq PT , 


And we thus have 


G nUk \ TT/ 7 T 

OPT' OPT ' ~ vv k • 


Substituting (12 1 in ([9]) we have 

> Wk + f3ai-i(GQ PT , — Wk ) 

= Wk (1 — /3ai_i) + Pai-iGQ PT f 


■1 -/: 


Further observe that since the greedy policy chooses k , we have Gq PT , < WJl l _ f3 


( 11 ) 

( 12 ) 

(13) 


or Wp > 


g; 


1-/9 

OPT' 1 -/ 3 ' 


and thus we have 


Gl 


g; 


OPT' 


> 


> 


(1 - P)(l - Poti-l) 
1 - P i 

(1 - /3)(1 - pai-i) 


+ fiOLi— 1 


+ fiOLi— 1 - 


l-P K 

Here the second inequality follows since i < K. Now consider the recurrence equation 

(! - P)(l - P tti-i) 


+ Poti-i- 


(14) 


We have that at < on-\ for a j_i > 


1 — P K 

anc l hence the sequence {a,} generated by the 


recurrence relation, with a\ = 1 is decreasing as long as a* > 1 +/j _p K ■ Further, we can verify that 


for on -1 > 


1+/3-9 


we have 


OLi -1 


= — P) — 

1 

< Cti -1 - 


(1 - /?)(! - Poti- i) 


1-/3 


K 


IK' 


1 +P~ 

Thus we can conclude that the sequence { 0 / 3 } is uniformly bounded below by a* = , , aK which 
is the fixed point of the recurrence equation. Thus we have that ax > < 

desired to prove. 

Thus after combining the bounds, we have that 


i+p-p* 

which is what we 


W{Q,p,P) > 1 - (1 -7)/3 Z 
V(Q,p,p) ~ 1 + 0-P 


K 


(15) 


Let L m i n be the minimum number of products in any category in L, i.e. L mm = min J= i ... // L ; . 
Now since L m i n < L m,n and K < H, which is the total number of categories, we have the 
following bound that holds irrespective of the relevance matrix Q at any given level: 

W{Q,p,P) 1 - (1 - 7)/3 l ™ 


V(Q,p,P) 


> 


1 + 0-0 


H 


(16) 
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( 17 ) 


Now let 71 = 1 and for i > 2 consider the recurrence equation 


1 ~ (1 ~ li-l)P L 

1 +P-P H 


i_ Q L min 

Now 7 , < 7 i_i as long as 7 $_i > 1 + p_p H _pL mi „ • Further we can verify that for 7^-1 

l—p L min 

l+p-pn-p 1 -™™ ’ 


7i-i - 7i = 7,-i 

< 7i-l 


1 - (1 - 7i-i)/3 L — 
1 + P~P H 
1 - p L ™™ 

l + /3~P H - 


> 


We can thus conclude that the sequence { 7 /} is uniformly bounded below by 7 * = 
which is the fixed point of the recurrence relation. Thus for any (Q, p, /3), 


W{Q,p,/3) 1 _ pL min 

V(Q,p,P) ~ 1 + p — P H — pLmin ' ^ ' 

□ □ 


7.4 Proof of Theorem l4.2l 


Proof. Consider the set (Ui, ■ ■ ■ , Uk) of the non-dominated classes of ad categories at the first 
experimentation opportunity. From the dynamic programming equation ([3ji we have 

v? = P(w(7r, 

Here L k as defined before are the number of ads in class k and Vj. is the optimal payoff-to-go 
conditional on the event E given that class k is also used up. We will approximate this payoff by //£ 
defined as 

l-B Lk 

^=P(w(7T,fc))( T f^-). 

Note that under the greedy policy, k is chosen to maximize The ratio of the two quantities is 


eL 

V* 


1-I3 L 


1-/3 Lk 
1-/3 


+ P Lk vl 


> l- p Lk > l-p Lmin . 


(19) 


Where the first inequality follows since 
L mm , since lemma 


'VI < 


1 -/ 


and second follows from the definition of 


3.3 


says that the number of ads in a class can only grow. We will later show in an 
example that for our greedy policy, this bound is tight. Now the optimal policy finds the best order 
in which to present the non-dominated equivalence classes which solves the following optimization 
problem. 


OPT 4 _ max V kl + pvg + p 2 VQ M + • • • + 


?K— It t-^1 

V k K 


■ ,kK -1 


( 20 ) 


Consider instead 


OPT 1 = 


max /i k 

ki,-,k K 


+ PPk 2 +P 2 T k k 3 ' k2 + ---+P 


K-l.ki,— ,k K -i 


( 21 ) 


Clearly (19 1 implies that 


OPT' 

OPT 


> 1 - P L 


Now using the same arguments as that used in the 


proof of Theorem |4. 1 1 for the optimization problem in definition (| 8 j, we can show that the greedy 
algorithm attains in (21). Since K < H, the result follows. 


□ 
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7.5 Proof of Theorem l4.3l 


Proof. For a user of type i, the total number of products with positive feedback is given by r t = 
^2f = i Qj L.j . Thus the expected total number of products with positive feedback is 

N 

R = ^ r l P x (i) 

i—1 

On the event W that the number of display opportunities is greater than L, any policy obtains the 
full payoff of R. Thus its expected payoff is bounded by 

Vg > P(W)R = P{C > L)R = P^R. 

Further the optimal policy cannot attain a payoff greater than R. Thus the ratio of the payoff under 
the any policy and that under the optimal policy is at least j3 L ~ l . □ 

7.6 Recursive computation of Farsighted Greedy 


Algorithm 4 (Farsighted greedy): Function [W(Q , p, ft), A(Q , p, fi)\ where Q is a relevance ma¬ 
trix and p is a probability distribution over user types. 

• If Q is empty, return W ( Q , p, ft) = 0. 

• If Q is non-empty, let the non-dominated equivalence classes be {U \. ■ ■ ■ , Uk ) and the 
number of products in each class be denoted by L k = Yhj^u k Ar Let N be the number of 
rows in Q corresponding to the user types. 

• Let the event u>{k) be the event {X £ S(k)} where 

S(k) = {i £ N : q) = IVj £ U k }. 

Let Q k be the matrix obtained after removing all the columns corresponding to categories 
in Uk and the rows corresponding to all the user types in S(k) c and let Q k es be the matrix 
obtained after removing all the columns corresponding to categories in Uk and the rows 
corresponding to all the user types in S(k). Finally, let //' denote the distribution on the 
user types conditional on the event {X £ S(k)} and p k es be the distribution on the user 
types conditional on {X £ S(k) c }. 

• Then define 

W k = P(S(k)) (^ZJ- + P Lk W(Q k ,p k , p)^j 

• Let W* * = maxfc W k and let k* £ arg maxfc W k 

Return 

W(Q,p,/3) = W* + /3(1 - P{S{k*))W{Q? es , P ? es ,p). 

A(Q,p,p) = k*. 
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